A telecom company wants to use their historical customer data to predict behaviour to retain customers. You can
analyse all relevant customer data and develop focused customer retention programs.
Each row represents a customer, each column contains customer’s attributes described on the column Metadata. The data set includes information about:
• Customers who left within the last month – the column is called Churn
• Services that each customer has signed up for – phone, multiple lines, internet, online security,
online backup, device protection, tech support, and streaming TV and movies
• Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly
charges, and total charges
• Demographic info about customers – gender, age range, and if they have partners and dependents
Build a model that will help to identify the potential customers who have a higher probability to churn.
This help the company to understand the pinpoints and patterns of customer churn and will increase the focus on strategising customer retention
• Import all the given datasets from MYSQL server. Explore shape and size.
• Merge all datasets onto one and explore final shape and size
import numpy as np
import pandas as pd
# pre-processing method
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
from sqlalchemy import create_engine
import pymysql
import pandas as pd
# methods and classes for evaluation
from sklearn import metrics
from math import sqrt
from sklearn.model_selection import cross_validate
import time
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB, BernoulliNB
from sklearn.linear_model import LogisticRegression
# methods and classes for evaluation
from sklearn.metrics import accuracy_score,f1_score,recall_score,precision_score, confusion_matrix
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics import classification_report
#import pydotplus as pydot
from sklearn import tree
from os import system
import scipy.stats as stats
from scipy.stats import chi2_contingency
db_connection_str = 'mysql+pymysql://root:1q2w3rto098@localhost:3306/telecomdata'
db_connection = create_engine(db_connection_str)
#reading customer churn dataset #1
cust_Churn_df_Raw1 = pd.read_sql('SELECT * FROM telecomdata.telcomcustomerchurn1', con=db_connection)
print("size of dataset #1 :", cust_Churn_df_Raw1.shape)
size of dataset #1 : (7043, 10)
There are only 10 column in this dataset, we need to concatenate other dataset with this one.
#reading customer churn dataset #2
cust_Churn_df_Raw2 = pd.read_sql('SELECT * FROM telecomdata.telcomcustomerchurn2', con=db_connection)
print("size of dataset #2 :", cust_Churn_df_Raw1.shape)
size of dataset #2 : (7043, 10)
cust_Churn_df_Raw1.columns
Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
'OnlineSecurity'],
dtype='object')
# merge / concatenate data set based on axis (axis = 1)
cust_Churn_df_Raw_Concat = pd.concat([cust_Churn_df_Raw1, cust_Churn_df_Raw2], axis=1)
It is noticed that, there are 2 ways to import the the provided data. In below cell we import data using CSV format and compare it with mysql data.
#reading CSV file which is provided along side above data set to compare the number of recods imported
cust_Churn_df_Raw_CSV = pd.read_csv('TelcomCustomer-Churn.csv')
cust_Churn_df_Raw_CSV[['customerID','gender','SeniorCitizen','Partner','Dependents','tenure','PhoneService','MultipleLines','InternetService','OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies','Contract','PaperlessBilling','PaymentMethod','MonthlyCharges','Churn']].equals(cust_Churn_df_Raw_Concat[['customerID','gender','SeniorCitizen','Partner','Dependents','tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges','Churn']])
True
True signifies that both data set are equal. The data import is fine !
shape_of_dataset = cust_Churn_df_Raw_Concat.shape
print("Number of rows in dataset ", shape_of_dataset[0])
print("Number of columns in dataset ", shape_of_dataset[1])
Number of rows in dataset 7043 Number of columns in dataset 21
print("Shape of dataset ", cust_Churn_df_Raw_Concat.size)
Shape of dataset 147903
cust_Churn_df_Raw_Concat.head()
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | ... | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | 5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | ... | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.50 | No |
| 2 | 3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | ... | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | 7795-CFOCW | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | ... | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | 9237-HQITU | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
5 rows × 21 columns
#making copy of dataset so it can be used for data analysis and visualization
cust_Churn_df_Raw_Concat_Copy_DataAnalysis = cust_Churn_df_Raw_Concat.copy()
cust_Churn_df_Raw_Concat_Copy_DataVisualization = cust_Churn_df_Raw_Concat.copy()
cust_Churn_df_Raw_Concat.columns
Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
dtype='object')
• Missing value treatment
• Convert categorical attributes to continuous using relevant functional knowledge
• Drop attribute/s if required using relevant functional knowledge
• Automate all the above steps
cust_Churn_df_Raw_Concat.isnull().sum()
customerID 0 gender 0 SeniorCitizen 0 Partner 0 Dependents 0 tenure 0 PhoneService 0 MultipleLines 0 InternetService 0 OnlineSecurity 0 OnlineBackup 0 DeviceProtection 0 TechSupport 0 StreamingTV 0 StreamingMovies 0 Contract 0 PaperlessBilling 0 PaymentMethod 0 MonthlyCharges 0 TotalCharges 11 Churn 0 dtype: int64
It is mentioned to automate the task of finding and imputing missing values, below mentioned line helps achieve that
column_list_Null_values = cust_Churn_df_Raw_Concat.columns[cust_Churn_df_Raw_Concat.isna().any()].tolist()
for i in column_list_Null_values:
cust_Churn_df_Raw_Concat[i].fillna(cust_Churn_df_Raw_Concat[i].mean(), inplace=True)
Whereever we find the column as null in the dataset it is replaced with mean value. Here we loop through all columns which are having NA as values and replace them, there by automating once we get new data.
cust_Churn_df_Raw_Concat.isnull().sum()
customerID 0 gender 0 SeniorCitizen 0 Partner 0 Dependents 0 tenure 0 PhoneService 0 MultipleLines 0 InternetService 0 OnlineSecurity 0 OnlineBackup 0 DeviceProtection 0 TechSupport 0 StreamingTV 0 StreamingMovies 0 Contract 0 PaperlessBilling 0 PaymentMethod 0 MonthlyCharges 0 TotalCharges 0 Churn 0 dtype: int64
#Checking the data type of the columns
cust_Churn_df_Raw_Concat.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7043 entries, 0 to 7042 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customerID 7043 non-null object 1 gender 7043 non-null object 2 SeniorCitizen 7043 non-null int64 3 Partner 7043 non-null object 4 Dependents 7043 non-null object 5 tenure 7043 non-null int64 6 PhoneService 7043 non-null object 7 MultipleLines 7043 non-null object 8 InternetService 7043 non-null object 9 OnlineSecurity 7043 non-null object 10 OnlineBackup 7043 non-null object 11 DeviceProtection 7043 non-null object 12 TechSupport 7043 non-null object 13 StreamingTV 7043 non-null object 14 StreamingMovies 7043 non-null object 15 Contract 7043 non-null object 16 PaperlessBilling 7043 non-null object 17 PaymentMethod 7043 non-null object 18 MonthlyCharges 7043 non-null float64 19 TotalCharges 7043 non-null float64 20 Churn 7043 non-null object dtypes: float64(2), int64(2), object(17) memory usage: 1.1+ MB
for i in cust_Churn_df_Raw_Concat.columns:
if(cust_Churn_df_Raw_Concat[i].dtype == np.object and i != 'customerID'):
print("Column ::", i , " :: " , cust_Churn_df_Raw_Concat[i].unique())
Column :: gender :: ['Female' 'Male'] Column :: Partner :: ['Yes' 'No'] Column :: Dependents :: ['No' 'Yes'] Column :: PhoneService :: ['No' 'Yes'] Column :: MultipleLines :: ['No phone service' 'No' 'Yes'] Column :: InternetService :: ['DSL' 'Fiber optic' 'No'] Column :: OnlineSecurity :: ['No' 'Yes' 'No internet service'] Column :: OnlineBackup :: ['Yes' 'No' 'No internet service'] Column :: DeviceProtection :: ['No' 'Yes' 'No internet service'] Column :: TechSupport :: ['No' 'Yes' 'No internet service'] Column :: StreamingTV :: ['No' 'Yes' 'No internet service'] Column :: StreamingMovies :: ['No' 'Yes' 'No internet service'] Column :: Contract :: ['Month-to-month' 'One year' 'Two year'] Column :: PaperlessBilling :: ['Yes' 'No'] Column :: PaymentMethod :: ['Electronic check' 'Mailed check' 'Bank transfer (automatic)' 'Credit card (automatic)'] Column :: Churn :: ['No' 'Yes']
Above columns looks to be categorical. PaymentMethod and gender can be OneHotEncoded, rest can be converted using dummies function since these column values can be graded - as having some service enabled or not enabled.
feature_corr = cust_Churn_df_Raw_Concat.corr()
feature_corr
| SeniorCitizen | tenure | MonthlyCharges | TotalCharges | |
|---|---|---|---|---|
| SeniorCitizen | 1.000000 | 0.016567 | 0.220173 | 0.102395 |
| tenure | 0.016567 | 1.000000 | 0.247900 | 0.824757 |
| MonthlyCharges | 0.220173 | 0.247900 | 1.000000 | 0.650468 |
| TotalCharges | 0.102395 | 0.824757 | 0.650468 | 1.000000 |
We observe good correlation between MonthlyCharges and TotalCharges
We observe good correlation between Tenure and TotalCharges
replaceStructChurn = {"Churn":{"No":0,"Yes":1}}
print(cust_Churn_df_Raw_Concat['tenure'].corr(cust_Churn_df_Raw_Concat[['Churn']].replace(replaceStructChurn).squeeze()))
print(cust_Churn_df_Raw_Concat['MonthlyCharges'].corr(cust_Churn_df_Raw_Concat[['Churn']].replace(replaceStructChurn).squeeze()))
print(cust_Churn_df_Raw_Concat['TotalCharges'].corr(cust_Churn_df_Raw_Concat[['Churn']].replace(replaceStructChurn).squeeze()))
-0.3522286701130779 0.19335642223784705 -0.19942771508831378
Ther existing continous columns are having good correlation with the Class variable.
# Using ChiSquare we determine importance of the variable - if the variable if not important - then we can plan to drop the variable.
alpha=0.05
dropColumnsList=[]
def ChiSquareTestIndependence(colX,colY):
X = cust_Churn_df_Raw_Concat[colX].astype(str)
Y = cust_Churn_df_Raw_Concat[colY].astype(str)
crossT = pd.crosstab(Y,X)
chi2, p, dof, expected = stats.chi2_contingency(crossT.values)
if p<alpha:
print("{0} is IMPORTANT for Prediction".format(colX))
else:
print("{0} is NOT an important predictor".format(colX))
dropColumnsList.append(colX)
for var in ['gender', 'SeniorCitizen', 'Partner', 'Dependents',
'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
'PaymentMethod','Churn']:
ChiSquareTestIndependence(colX=var,colY="Churn" )
gender is NOT an important predictor SeniorCitizen is IMPORTANT for Prediction Partner is IMPORTANT for Prediction Dependents is IMPORTANT for Prediction tenure is IMPORTANT for Prediction PhoneService is NOT an important predictor MultipleLines is IMPORTANT for Prediction InternetService is IMPORTANT for Prediction OnlineSecurity is IMPORTANT for Prediction OnlineBackup is IMPORTANT for Prediction DeviceProtection is IMPORTANT for Prediction TechSupport is IMPORTANT for Prediction StreamingTV is IMPORTANT for Prediction StreamingMovies is IMPORTANT for Prediction Contract is IMPORTANT for Prediction PaperlessBilling is IMPORTANT for Prediction PaymentMethod is IMPORTANT for Prediction Churn is IMPORTANT for Prediction
#what ever columns are determined as not important from above test are removed.
cust_Churn_df_Raw_Concat.drop(dropColumnsList, axis = 1, inplace = True)
droping all the column which are determined as not important in the chiSquare test, there by automating once we get new data.
#having all column which are need to be converted and churn out the column which are insigificant
replaceStruct = {
"Partner":{"No":2,"Yes":1},
"Dependents":{"No":2,"Yes":1},
"PhoneService":{"No":2,"Yes":1},
"MultipleLines": {"No phone service":3,"No":2,"Yes":1},
"InternetService":{"No":3,"DSL":2,"Fiber optic":1},
"OnlineSecurity":{"No internet service":3,"No":2,"Yes":1},
"OnlineBackup":{"No internet service":3,"No":2,"Yes":1},
"DeviceProtection":{"No internet service":3,"No":2,"Yes":1},
"TechSupport":{"No internet service":3,"No":2,"Yes":1},
"StreamingTV":{"No internet service":3,"No":2,"Yes":1},
"StreamingMovies":{"No internet service":3,"No":2,"Yes":1},
"Contract":{"Month-to-month":3,"One year":2,"Two year":1},
"PaperlessBilling":{"No":2,"Yes":1},
"Churn":{"No":0,"Yes":1}
}
oneHotColsTemp=["gender","PaymentMethod"]
oneHotCols = [elt for elt in oneHotColsTemp if elt not in dropColumnsList]
cust_Churn_df_Raw_Concat_Replaced=pd.get_dummies(cust_Churn_df_Raw_Concat, columns=oneHotCols)
cust_Churn_df_Raw_Concat_Replaced=cust_Churn_df_Raw_Concat_Replaced.replace(replaceStruct)
#Check the column data type of the columns
cust_Churn_df_Raw_Concat_Replaced.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7043 entries, 0 to 7042 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customerID 7043 non-null object 1 SeniorCitizen 7043 non-null int64 2 Partner 7043 non-null int64 3 Dependents 7043 non-null int64 4 tenure 7043 non-null int64 5 MultipleLines 7043 non-null int64 6 InternetService 7043 non-null int64 7 OnlineSecurity 7043 non-null int64 8 OnlineBackup 7043 non-null int64 9 DeviceProtection 7043 non-null int64 10 TechSupport 7043 non-null int64 11 StreamingTV 7043 non-null int64 12 StreamingMovies 7043 non-null int64 13 Contract 7043 non-null int64 14 PaperlessBilling 7043 non-null int64 15 MonthlyCharges 7043 non-null float64 16 TotalCharges 7043 non-null float64 17 Churn 7043 non-null int64 18 PaymentMethod_Bank transfer (automatic) 7043 non-null uint8 19 PaymentMethod_Credit card (automatic) 7043 non-null uint8 20 PaymentMethod_Electronic check 7043 non-null uint8 21 PaymentMethod_Mailed check 7043 non-null uint8 dtypes: float64(2), int64(15), object(1), uint8(4) memory usage: 1018.1+ KB
All columns are in numerical data type as needed to build the model.
• Perform detailed statistical analysis on the data.
• Perform a detailed univariate, bivariate and multivariate analysis with appropriate detailed comments after each analysis.
For continous variable - we shall descirbe to know more about variable and also plot few graphs to know more about behaviour patterns.
cust_Churn_df_Raw_Concat_Copy_DataAnalysis[['tenure','MonthlyCharges','TotalCharges']].describe()
| tenure | MonthlyCharges | TotalCharges | |
|---|---|---|---|
| count | 7043.000000 | 7043.000000 | 7032.000000 |
| mean | 32.371149 | 64.761692 | 2283.300441 |
| std | 24.559481 | 30.090047 | 2266.771362 |
| min | 0.000000 | 18.250000 | 18.800000 |
| 25% | 9.000000 | 35.500000 | 401.450000 |
| 50% | 29.000000 | 70.350000 | 1397.475000 |
| 75% | 55.000000 | 89.850000 | 3794.737500 |
| max | 72.000000 | 118.750000 | 8684.800000 |
There are only 3 continous variable in the dataset
Tenure:
75% of data are below value 55, there are no outliers observed in data since max and 75% value are not having too much difference.
Since min value is 0, we need to take closer look at data.
MonthlyCharges:
75% of data are below value 89, there are no outliers observed in data sicne max and 75% value are not having too much difference.
Mean and Median are having difference where in mean value is less then median and there would be right skewness in data.
TotalCharges:
75% of data are below 3795, Since max and 75% are having difference. We need check if outlier exists for the data.
cust_Churn_df_Raw_Concat_Copy_DataAnalysis.corr()
| SeniorCitizen | tenure | MonthlyCharges | TotalCharges | |
|---|---|---|---|---|
| SeniorCitizen | 1.000000 | 0.016567 | 0.220173 | 0.102411 |
| tenure | 0.016567 | 1.000000 | 0.247900 | 0.825880 |
| MonthlyCharges | 0.220173 | 0.247900 | 1.000000 | 0.651065 |
| TotalCharges | 0.102411 | 0.825880 | 0.651065 | 1.000000 |
import warnings
warnings.simplefilter(action='ignore', category=UserWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)
fig, ax = plt.subplots(2,2, figsize=(18,18))
listOfColumns = ['MonthlyCharges', 'TotalCharges','tenure']
len_listOfColumns = len(listOfColumns)
index = 0
for j in range(len(ax)):
for i in range(len(ax[j])):
if(index < len_listOfColumns):
ax[j][i] = sns.distplot(cust_Churn_df_Raw_Concat_Copy_DataAnalysis[listOfColumns[index]],bins=20,rug=True,ax = ax[j][i]);
index = index + 1
fig.show()
import warnings
warnings.simplefilter(action='ignore', category=UserWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)
fig, ax = plt.subplots(2,2, figsize=(18,12))
listOfColumns = ['MonthlyCharges', 'TotalCharges','tenure']
len_listOfColumns = len(listOfColumns)
index = 0
for j in range(len(ax)):
for i in range(len(ax[j])):
if(index < len_listOfColumns):
ax[j][i] = sns.boxplot(cust_Churn_df_Raw_Concat_Copy_DataAnalysis[listOfColumns[index]],ax = ax[j][i]);
index = index + 1
fig.show()
For categorical variable - bar graph, count plot would help to know more about behaviour patterns.
import warnings
warnings.simplefilter(action='ignore', category=UserWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)
fig, ax = plt.subplots(2,2, figsize=(18,12))
listOfColumns = ['gender', 'SeniorCitizen', 'Partner', 'Dependents']
len_listOfColumns = len(listOfColumns)
index = 0
for j in range(len(ax)):
for i in range(len(ax[j])):
if(index < len_listOfColumns):
ax[j][i] = sns.countplot(x = listOfColumns[index],data = cust_Churn_df_Raw_Concat_Copy_DataAnalysis,ax = ax[j][i]);
index = index + 1
fig.show()
import warnings
warnings.simplefilter(action='ignore', category=UserWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)
fig, ax = plt.subplots(5,2, figsize=(18,18))
listOfColumns = ['PhoneService', 'MultipleLines', 'InternetService'
,'OnlineSecurity','OnlineBackup', 'DeviceProtection', 'TechSupport'
,'StreamingTV', 'StreamingMovies']
len_listOfColumns = len(listOfColumns)
index = 0
for j in range(len(ax)):
for i in range(len(ax[j])):
if(index < len_listOfColumns):
ax[j][i] = sns.countplot(x = listOfColumns[index],data = cust_Churn_df_Raw_Concat_Copy_DataAnalysis,ax = ax[j][i]);
index = index + 1
fig.show()
fig, ax = plt.subplots(2,2, figsize=(18,12))
listOfColumns = ['Contract', 'PaperlessBilling','PaymentMethod']
len_listOfColumns = len(listOfColumns)
index = 0
for j in range(len(ax)):
for i in range(len(ax[j])):
if(index < len_listOfColumns):
ax[j][i] = sns.countplot(x = listOfColumns[index],data = cust_Churn_df_Raw_Concat_Copy_DataAnalysis,ax = ax[j][i]);
index = index + 1
fig.show()
replaceStructChurn = {"Churn":{"No":0,"Yes":1}}
sns.countplot(x = 'Churn',data = cust_Churn_df_Raw_Concat_Copy_DataAnalysis[['Churn']]);
print(cust_Churn_df_Raw_Concat_Copy_DataAnalysis[['Churn']].value_counts(normalize=True))
Churn No 0.73463 Yes 0.26537 dtype: float64
sns.set(rc={'figure.figsize':(11.7,5.27)})
sns.countplot(x="Contract", hue="Churn", data=cust_Churn_df_Raw_Concat_Copy_DataAnalysis);
sns.countplot(x="PaperlessBilling", hue="Churn", data=cust_Churn_df_Raw_Concat_Copy_DataAnalysis);
sns.countplot(x="PaymentMethod", hue="Churn", data=cust_Churn_df_Raw_Concat_Copy_DataAnalysis);
g = sns.catplot(x="InternetService",
y = "MonthlyCharges",
hue="Churn",
data=cust_Churn_df_Raw_Concat_Copy_DataAnalysis,
kind="swarm");
g.set_xticklabels(rotation=30);
g = sns.catplot(x="PhoneService",
y = "MonthlyCharges",
hue="Churn",
data=cust_Churn_df_Raw_Concat_Copy_DataAnalysis,
kind="swarm");
g.set_xticklabels(rotation=30);
g = sns.catplot(x="InternetService",
y = "tenure",
hue="Churn",
data=cust_Churn_df_Raw_Concat_Copy_DataAnalysis,
kind="swarm");
g.set_xticklabels(rotation=30);
g = sns.catplot(x="PhoneService",
y = "tenure",
hue="Churn",
data=cust_Churn_df_Raw_Concat_Copy_DataAnalysis,
kind="swarm");
g.set_xticklabels(rotation=30);
g = sns.catplot(x="PaymentMethod",
y = "tenure",
hue="Churn",
col="PaperlessBilling",
data=cust_Churn_df_Raw_Concat_Copy_DataAnalysis,
kind="swarm");
g.set_xticklabels(rotation=30);
g = sns.catplot(x="Contract",
y = "MonthlyCharges",
hue="Churn",
data=cust_Churn_df_Raw_Concat_Copy_DataAnalysis,
kind="swarm");
g.set_xticklabels(rotation=30)
<seaborn.axisgrid.FacetGrid at 0x13c997adc88>
g = sns.catplot(x="PaperlessBilling",
y = "tenure",
hue="Churn",
col="Contract",
data=cust_Churn_df_Raw_Concat_Copy_DataAnalysis,
kind="swarm");
g.set_xticklabels(rotation=30);
Based on Tenure -
g = sns.catplot(x="PaperlessBilling",
y = "MonthlyCharges",
hue="Churn",
col="Contract",
data=cust_Churn_df_Raw_Concat_Copy_DataAnalysis,
kind="swarm");
g.set_xticklabels(rotation=30);
Based on Monthly Charges -
g = sns.catplot(x="PaymentMethod",
y = "MonthlyCharges",
hue="Churn",
col="Contract",
data=cust_Churn_df_Raw_Concat_Copy_DataAnalysis,
kind="swarm");
g.set_xticklabels(rotation=30);
Based on Monthly Charges -
g = sns.catplot(x="PaymentMethod",
y = "tenure",
hue="Churn",
col="Contract",
data=cust_Churn_df_Raw_Concat_Copy_DataAnalysis,
kind="swarm");
g.set_xticklabels(rotation=30);
Based on Tenure -
g = sns.catplot(x="DeviceProtection",
y = "MonthlyCharges",
hue="Churn",
col="InternetService",
data=cust_Churn_df_Raw_Concat_Copy_DataAnalysis,
kind="swarm");
g.set_xticklabels(rotation=30);
g = sns.catplot(x="OnlineSecurity",
y = "MonthlyCharges",
hue="Churn",
col="InternetService",
data=cust_Churn_df_Raw_Concat_Copy_DataAnalysis,
kind="swarm");
g.set_xticklabels(rotation=30);
g = sns.catplot(x="StreamingTV",
y = "MonthlyCharges",
hue="Churn",
col="InternetService",
data=cust_Churn_df_Raw_Concat_Copy_DataAnalysis,
kind="swarm");
g.set_xticklabels(rotation=30);
g = sns.catplot(x="TechSupport",
y = "MonthlyCharges",
hue="Churn",
col="InternetService",
data=cust_Churn_df_Raw_Concat_Copy_DataAnalysis,
kind="swarm");
g.set_xticklabels(rotation=30);
g = sns.catplot(x="OnlineSecurity",
y = "tenure",
hue="Churn",
col="InternetService",
data=cust_Churn_df_Raw_Concat_Copy_DataAnalysis,
kind="swarm");
g.set_xticklabels(rotation=30);
g = sns.catplot(x="MultipleLines",
y = "MonthlyCharges",
hue="Churn",
col="PhoneService",
data=cust_Churn_df_Raw_Concat_Copy_DataAnalysis,
kind="swarm");
g.set_xticklabels(rotation=30);
g = sns.catplot(x="MultipleLines",
y = "tenure",
hue="Churn",
col="PhoneService",
data=cust_Churn_df_Raw_Concat_Copy_DataAnalysis,
kind="swarm");
g.set_xticklabels(rotation=30);
sns.pairplot(cust_Churn_df_Raw_Concat_Copy_DataAnalysis)
<seaborn.axisgrid.PairGrid at 0x13c9ae1e7f0>
corrdata = cust_Churn_df_Raw_Concat_Copy_DataAnalysis[['MonthlyCharges','TotalCharges','tenure']]
corrdata['Churn'] = cust_Churn_df_Raw_Concat_Copy_DataAnalysis[['Churn']].replace(replaceStruct)
plt.figure(figsize=(20,16))
sns.heatmap(corrdata.corr(),
annot=True,
linewidths=.5,
center=0,
cbar=False,
cmap="YlGnBu")
plt.show()
D:\anaconda\envs\gl\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
• Segregate predictors vs target attributes
• Check for target balancing and fix it if found imbalanced.
• Perform train-test split.
• Check if the train and test data have similar statistical characteristics when compared with original data
#Checking the column list before segregating.
cust_Churn_df_Raw_Concat_Replaced.columns
Index(['customerID', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure',
'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup',
'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies',
'Contract', 'PaperlessBilling', 'MonthlyCharges', 'TotalCharges',
'Churn', 'PaymentMethod_Bank transfer (automatic)',
'PaymentMethod_Credit card (automatic)',
'PaymentMethod_Electronic check', 'PaymentMethod_Mailed check'],
dtype='object')
cust_Churn_df_Raw_Concat_Replaced['customerID'] = cust_Churn_df_Raw_Concat_Replaced[['customerID']]
cust_Churn_df_Raw_Concat_Replaced['MonthlyCharges'] = cust_Churn_df_Raw_Concat_Replaced[['MonthlyCharges']]
cust_Churn_df_Raw_Concat_Replaced['TotalCharges'] = cust_Churn_df_Raw_Concat_Replaced[['TotalCharges']]
cust_Churn_df_Raw_Concat_Replaced['tenure'] = cust_Churn_df_Raw_Concat_Replaced[['tenure']]
Y = cust_Churn_df_Raw_Concat_Replaced.Churn
X = cust_Churn_df_Raw_Concat_Replaced[
['SeniorCitizen',
'Partner',
'Dependents',
'MultipleLines',
'InternetService',
'OnlineSecurity',
'Contract',
'PaperlessBilling',
'PaymentMethod_Bank transfer (automatic)',
'PaymentMethod_Credit card (automatic)',
'PaymentMethod_Electronic check',
'PaymentMethod_Mailed check',
'MonthlyCharges',
'TotalCharges',
'tenure']]
Y.value_counts(normalize=True)
0 0.73463 1 0.26537 Name: Churn, dtype: float64
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =train_test_split(X, Y, test_size=.30, random_state=22,stratify = Y)
X_train.shape,X_test.shape
((4930, 15), (2113, 15))
print(y_train.value_counts(normalize=True))
print(y_test.value_counts(normalize=True))
0 0.734686 1 0.265314 Name: Churn, dtype: float64 0 0.734501 1 0.265499 Name: Churn, dtype: float64
X_train.describe()
| SeniorCitizen | Partner | Dependents | MultipleLines | InternetService | OnlineSecurity | Contract | PaperlessBilling | PaymentMethod_Bank transfer (automatic) | PaymentMethod_Credit card (automatic) | PaymentMethod_Electronic check | PaymentMethod_Mailed check | MonthlyCharges | TotalCharges | tenure | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 4930.000000 | 4930.000000 | 4930.000000 | 4930.00000 | 4930.000000 | 4930.000000 | 4930.000000 | 4930.000000 | 4930.000000 | 4930.000000 | 4930.000000 | 4930.000000 | 4930.000000 | 4930.000000 | 4930.000000 |
| mean | 0.162880 | 1.520487 | 1.704260 | 1.67789 | 1.772819 | 1.928803 | 2.317241 | 1.407099 | 0.215822 | 0.211765 | 0.338742 | 0.233671 | 64.825811 | 2268.529027 | 31.984787 |
| std | 0.369294 | 0.499631 | 0.456421 | 0.64404 | 0.778797 | 0.705025 | 0.829786 | 0.491344 | 0.411433 | 0.408600 | 0.473330 | 0.423208 | 30.025233 | 2262.950819 | 24.531991 |
| min | 0.000000 | 1.000000 | 1.000000 | 1.00000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 18.250000 | 18.800000 | 0.000000 |
| 25% | 0.000000 | 1.000000 | 1.000000 | 1.00000 | 1.000000 | 1.000000 | 2.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 35.500000 | 390.387500 | 8.000000 |
| 50% | 0.000000 | 2.000000 | 2.000000 | 2.00000 | 2.000000 | 2.000000 | 3.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 70.450000 | 1378.600000 | 28.000000 |
| 75% | 0.000000 | 2.000000 | 2.000000 | 2.00000 | 2.000000 | 2.000000 | 3.000000 | 2.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 89.850000 | 3778.075000 | 55.000000 |
| max | 1.000000 | 2.000000 | 2.000000 | 3.00000 | 3.000000 | 3.000000 | 3.000000 | 2.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 118.750000 | 8684.800000 | 72.000000 |
X_test.describe()
| SeniorCitizen | Partner | Dependents | MultipleLines | InternetService | OnlineSecurity | Contract | PaperlessBilling | PaymentMethod_Bank transfer (automatic) | PaymentMethod_Credit card (automatic) | PaymentMethod_Electronic check | PaymentMethod_Mailed check | MonthlyCharges | TotalCharges | tenure | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2113.000000 | 2113.000000 | 2113.000000 | 2113.000000 | 2113.000000 | 2113.000000 | 2113.000000 | 2113.000000 | 2113.000000 | 2113.000000 | 2113.000000 | 2113.00000 | 2113.000000 | 2113.000000 | 2113.000000 |
| mean | 0.160435 | 1.508755 | 1.691434 | 1.668244 | 1.787033 | 1.932797 | 2.291529 | 1.409371 | 0.227165 | 0.226219 | 0.328916 | 0.21770 | 64.612092 | 2317.764742 | 33.272598 |
| std | 0.367096 | 0.500042 | 0.462011 | 0.639765 | 0.779157 | 0.708597 | 0.842867 | 0.491834 | 0.419099 | 0.418481 | 0.469931 | 0.41278 | 30.247329 | 2269.937167 | 24.605726 |
| min | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 18.400000 | 19.100000 | 0.000000 |
| 25% | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 35.550000 | 429.550000 | 9.000000 |
| 50% | 0.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 3.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 70.150000 | 1463.450000 | 31.000000 |
| 75% | 0.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 3.000000 | 2.000000 | 0.000000 | 0.000000 | 1.000000 | 0.00000 | 89.900000 | 3801.700000 | 56.000000 |
| max | 1.000000 | 2.000000 | 2.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 2.000000 | 1.000000 | 1.000000 | 1.000000 | 1.00000 | 118.350000 | 8564.750000 | 72.000000 |
X.describe()
| SeniorCitizen | Partner | Dependents | MultipleLines | InternetService | OnlineSecurity | Contract | PaperlessBilling | PaymentMethod_Bank transfer (automatic) | PaymentMethod_Credit card (automatic) | PaymentMethod_Electronic check | PaymentMethod_Mailed check | MonthlyCharges | TotalCharges | tenure | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 7043.000000 | 7043.000000 | 7043.000000 | 7043.000000 | 7043.000000 | 7043.000000 | 7043.000000 | 7043.000000 | 7043.000000 | 7043.000000 | 7043.000000 | 7043.000000 | 7043.000000 | 7043.000000 | 7043.000000 |
| mean | 0.162147 | 1.516967 | 1.700412 | 1.674996 | 1.777084 | 1.930001 | 2.309527 | 1.407781 | 0.219225 | 0.216101 | 0.335794 | 0.228880 | 64.761692 | 2283.300441 | 32.371149 |
| std | 0.368612 | 0.499748 | 0.458110 | 0.642730 | 0.778877 | 0.706051 | 0.833755 | 0.491457 | 0.413751 | 0.411613 | 0.472301 | 0.420141 | 30.090047 | 2265.000258 | 24.559481 |
| min | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 18.250000 | 18.800000 | 0.000000 |
| 25% | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 2.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 35.500000 | 402.225000 | 9.000000 |
| 50% | 0.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 3.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 70.350000 | 1400.550000 | 29.000000 |
| 75% | 0.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 3.000000 | 2.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 89.850000 | 3786.600000 | 55.000000 |
| max | 1.000000 | 2.000000 | 2.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 2.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 118.750000 | 8684.800000 | 72.000000 |
#Simple model using entropy without any parameter. This model will overfit training data and less accurate on testing data
#below is with gini criterion
model_entropy=DecisionTreeClassifier()
model_entropy.fit(X_train, y_train)
print(model_entropy.score(X_train, y_train)) # performance on train data
print(model_entropy.score(X_test, y_test)) # performance on test data
0.9975659229208925 0.7283483199242783
#Simple model using entropy without any parameter. This model will overfit training data and less accurate on testing data
#below is with entropy criterion
model_entropy=DecisionTreeClassifier(criterion='entropy')
model_entropy.fit(X_train, y_train)
print(model_entropy.score(X_train, y_train)) # performance on train data
print(model_entropy.score(X_test, y_test)) # performance on test data
0.9975659229208925 0.7245622337908187
start = time.process_time()
pruned_classifier = DecisionTreeClassifier(
criterion = "entropy",
random_state = 22,
max_depth=7,
min_samples_leaf=7,
max_features = 12
)
pruned_classifier.fit(X_train, y_train)
predictions_pruned = pruned_classifier.predict(X_test)
predictions_pruned_train = pruned_classifier.predict(X_train)
processedTime = time.process_time() - start
acc_DT_test = accuracy_score(y_test, predictions_pruned)
acc_DT_train = accuracy_score(y_train, predictions_pruned_train)
print("Train accuracy ", acc_DT_train)
print("Test Accuracy ", acc_DT_test)
precision,recall,fscore,support=score(y_test,predictions_pruned,average='macro')
print('Precision : {}'.format(precision))
print('Recall : {}'.format(recall))
print('F-score : {}'.format(fscore))
# print classification report - test
print("classification report for test")
print(classification_report(y_test, predictions_pruned))
# print classification report - train
print("classification report for train")
print(classification_report(y_train,predictions_pruned_train))
print('processedTime : {}'.format(processedTime))
cm=metrics.confusion_matrix(y_test, predictions_pruned,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g');
compareDf = pd.DataFrame({
'Method':['Decision Tree - Pruned'],
'Train_accuracy': acc_DT_train,
'Test_accuracy': acc_DT_test,
'precision' : precision,
'Recall' : recall,
'F-score' : fscore,
'processedTime' : processedTime
})
compareDf = compareDf[['Method', 'Train_accuracy','Test_accuracy','precision','Recall','F-score','processedTime']]
Train accuracy 0.8117647058823529
Test Accuracy 0.795551348793185
Precision : 0.7381009043312708
Recall : 0.7413136060422294
F-score : 0.7396661122404977
classification report for test
precision recall f1-score support
0 0.86 0.86 0.86 1552
1 0.61 0.63 0.62 561
accuracy 0.80 2113
macro avg 0.74 0.74 0.74 2113
weighted avg 0.80 0.80 0.80 2113
classification report for train
precision recall f1-score support
0 0.88 0.87 0.87 3622
1 0.64 0.66 0.65 1308
accuracy 0.81 4930
macro avg 0.76 0.76 0.76 4930
weighted avg 0.81 0.81 0.81 4930
processedTime : 0.046875
compareDf
| Method | Train_accuracy | Test_accuracy | precision | Recall | F-score | processedTime | |
|---|---|---|---|---|---|---|---|
| 0 | Decision Tree - Pruned | 0.811765 | 0.795551 | 0.738101 | 0.741314 | 0.739666 | 0.046875 |
Bagging based different classifiers methods.
from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
simplefilter("ignore", category=ConvergenceWarning)
lr = LogisticRegression();
bnb = BernoulliNB()
gnb = GaussianNB()
svc = SVC()
dtc = DecisionTreeClassifier(criterion="gini")
#Since SVC take more time to complete, it is not included in the test. But as per observation it is not providing comparable good results.
base_methods=[lr, bnb, gnb, dtc]
for bm in base_methods:
start = time.process_time()
print("Method: ", bm)
bag_model=BaggingClassifier(base_estimator=bm, n_estimators=100, bootstrap=True)
bag_model=bag_model.fit(X_train,y_train)
y_predict = bag_model.predict(X_test)
y_predict_train = bag_model.predict(X_train)
processedTime = time.process_time() - start
acc_test = accuracy_score(y_test, y_predict)
acc_train = accuracy_score(y_train, y_predict_train)
print("Macro result for test data")
precision,recall,fscore,support=score(y_test,y_predict,average='macro')
print('Precision : {}'.format(precision))
print('Recall : {}'.format(recall))
print('F-score : {}'.format(fscore))
# print classification report - test
print("classification report for test")
print(classification_report(y_test, y_predict))
# print classification report - train
print("classification report for train")
print(classification_report(y_train,y_predict_train))
print("performance on train data #1" , acc_train) # performance on train data
print("performance on test data #1" , acc_test) # performance on test data
tempcompareDf = pd.DataFrame({
'Method':[bm],
'Train_accuracy': acc_train,
'Test_accuracy': acc_test,
'precision' : precision,
'Recall' : recall,
'F-score' : fscore,
'processedTime' : processedTime
})
tempcompareDf = tempcompareDf[['Method', 'Train_accuracy','Test_accuracy', 'precision','Recall','F-score','processedTime']]
tempcompareDf
compareDf = pd.concat([compareDf, tempcompareDf])
Method: LogisticRegression()
Macro result for test data
Precision : 0.7510930643045681
Recall : 0.7137073432934561
F-score : 0.727805863572544
classification report for test
precision recall f1-score support
0 0.84 0.90 0.87 1552
1 0.66 0.52 0.59 561
accuracy 0.80 2113
macro avg 0.75 0.71 0.73 2113
weighted avg 0.79 0.80 0.79 2113
classification report for train
precision recall f1-score support
0 0.84 0.90 0.87 3622
1 0.64 0.52 0.58 1308
accuracy 0.80 4930
macro avg 0.74 0.71 0.72 4930
weighted avg 0.79 0.80 0.79 4930
performance on train data #1 0.7965517241379311
performance on test data #1 0.8026502602934217
Method: BernoulliNB()
Macro result for test data
Precision : 0.6512015332448768
Recall : 0.5588970358527666
F-score : 0.550852591254516
classification report for test
precision recall f1-score support
0 0.76 0.95 0.84 1552
1 0.54 0.17 0.26 561
accuracy 0.74 2113
macro avg 0.65 0.56 0.55 2113
weighted avg 0.70 0.74 0.69 2113
classification report for train
precision recall f1-score support
0 0.76 0.95 0.84 3622
1 0.53 0.17 0.26 1308
accuracy 0.74 4930
macro avg 0.64 0.56 0.55 4930
weighted avg 0.70 0.74 0.69 4930
performance on train data #1 0.7397565922920892
performance on test data #1 0.7415996213913867
Method: GaussianNB()
Macro result for test data
Precision : 0.7123717568938919
Recall : 0.7494934946799714
F-score : 0.7225531287141973
classification report for test
precision recall f1-score support
0 0.89 0.78 0.83 1552
1 0.54 0.72 0.62 561
accuracy 0.76 2113
macro avg 0.71 0.75 0.72 2113
weighted avg 0.79 0.76 0.77 2113
classification report for train
precision recall f1-score support
0 0.89 0.76 0.82 3622
1 0.53 0.74 0.62 1308
accuracy 0.76 4930
macro avg 0.71 0.75 0.72 4930
weighted avg 0.80 0.76 0.77 4930
performance on train data #1 0.7563894523326572
performance on test data #1 0.7624230951254141
Method: DecisionTreeClassifier()
Macro result for test data
Precision : 0.7208750560950417
Recall : 0.6910708050792951
F-score : 0.7024185642641597
classification report for test
precision recall f1-score support
0 0.83 0.89 0.86 1552
1 0.61 0.50 0.55 561
accuracy 0.78 2113
macro avg 0.72 0.69 0.70 2113
weighted avg 0.77 0.78 0.77 2113
classification report for train
precision recall f1-score support
0 1.00 1.00 1.00 3622
1 1.00 0.99 1.00 1308
accuracy 1.00 4930
macro avg 1.00 1.00 1.00 4930
weighted avg 1.00 1.00 1.00 4930
performance on train data #1 0.9975659229208925
performance on test data #1 0.7827733080927591
compareDf
| Method | Train_accuracy | Test_accuracy | precision | Recall | F-score | processedTime | |
|---|---|---|---|---|---|---|---|
| 0 | Decision Tree - Pruned | 0.811765 | 0.795551 | 0.738101 | 0.741314 | 0.739666 | 0.046875 |
| 0 | LogisticRegression() | 0.796552 | 0.802650 | 0.751093 | 0.713707 | 0.727806 | 20.515625 |
| 0 | BernoulliNB() | 0.739757 | 0.741600 | 0.651202 | 0.558897 | 0.550853 | 1.921875 |
| 0 | GaussianNB() | 0.756389 | 0.762423 | 0.712372 | 0.749493 | 0.722553 | 0.968750 |
| 0 | DecisionTreeClassifier() | 0.997566 | 0.782773 | 0.720875 | 0.691071 | 0.702419 | 2.437500 |
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_fscore_support as score
start = time.process_time()
abcl = AdaBoostClassifier( n_estimators= 100, learning_rate=0.1, random_state=22)
abcl = abcl.fit(X_train, y_train)
y_predict = abcl.predict(X_test)
y_predict_train = abcl.predict(X_train)
processedTime = time.process_time() - start
acc_test = accuracy_score(y_test, y_predict)
acc_train = accuracy_score(y_train, y_predict_train)
print("Macro result for test data")
precision,recall,fscore,support=score(y_test,y_predict,average='macro')
print('Precision : {}'.format(precision))
print('Recall : {}'.format(recall))
print('F-score : {}'.format(fscore))
# print classification report - test
print("classification report for test")
print(classification_report(y_test, y_predict))
# print classification report - train
print("classification report for train")
print(classification_report(y_train,y_predict_train))
print("performance on train data" , acc_train) # performance on train data
print("performance on test data" , acc_test) # performance on test data
cm=metrics.confusion_matrix(y_test, y_predict,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g');
tempcompareDf = pd.DataFrame({
'Method': ['AdaBoosting'],
'Train_accuracy': acc_train,
'Test_accuracy': acc_test,
'precision' : precision,
'Recall' : recall,
'F-score' : fscore,
'processedTime' : processedTime
})
tempcompareDf = tempcompareDf[['Method', 'Train_accuracy','Test_accuracy', 'precision','Recall','F-score','processedTime']]
tempcompareDf
compareDf = pd.concat([compareDf, tempcompareDf])
Macro result for test data
Precision : 0.7693476820891595
Recall : 0.7045592370031424
F-score : 0.7249434649851141
classification report for test
precision recall f1-score support
0 0.83 0.93 0.88 1552
1 0.71 0.48 0.57 561
accuracy 0.81 2113
macro avg 0.77 0.70 0.72 2113
weighted avg 0.80 0.81 0.80 2113
classification report for train
precision recall f1-score support
0 0.83 0.92 0.87 3622
1 0.68 0.47 0.56 1308
accuracy 0.80 4930
macro avg 0.75 0.70 0.71 4930
weighted avg 0.79 0.80 0.79 4930
performance on train data 0.8002028397565923
performance on test data 0.8092759110269758
from sklearn.ensemble import GradientBoostingClassifier
start = time.process_time()
gbcl = GradientBoostingClassifier(n_estimators = 50, learning_rate = 0.1, random_state=22)
gbcl = gbcl.fit(X_train, y_train)
y_predict = gbcl.predict(X_test)
y_predict_train = gbcl.predict(X_train)
processedTime = time.process_time() - start
acc_test = accuracy_score(y_test, y_predict)
acc_train = accuracy_score(y_train, y_predict_train)
print("Macro result for test data")
precision,recall,fscore,support=score(y_test,y_predict,average='macro')
print('Precision : {}'.format(precision))
print('Recall : {}'.format(recall))
print('F-score : {}'.format(fscore))
# print classification report - test
print("classification report for test")
print(classification_report(y_test, y_predict))
# print classification report - train
print("classification report for train")
print(classification_report(y_train,y_predict_train))
print("performance on train data" , acc_train) # performance on train data
print("performance on test data" , acc_test) # performance on test data
cm=metrics.confusion_matrix(y_test, y_predict,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g');
tempcompareDf = pd.DataFrame({
'Method': ['GradientBoosting'],
'Train_accuracy': acc_train,
'Test_accuracy': acc_test,
'precision' : precision,
'Recall' : recall,
'F-score' : fscore,
'processedTime' : processedTime
})
tempcompareDf = tempcompareDf[['Method', 'Train_accuracy','Test_accuracy', 'precision','Recall','F-score','processedTime']]
tempcompareDf
compareDf = pd.concat([compareDf, tempcompareDf])
Macro result for test data
Precision : 0.760903958739894
Recall : 0.7149747551316684
F-score : 0.7314633260711692
classification report for test
precision recall f1-score support
0 0.84 0.91 0.87 1552
1 0.68 0.52 0.59 561
accuracy 0.81 2113
macro avg 0.76 0.71 0.73 2113
weighted avg 0.80 0.81 0.80 2113
classification report for train
precision recall f1-score support
0 0.84 0.91 0.88 3622
1 0.69 0.53 0.60 1308
accuracy 0.81 4930
macro avg 0.77 0.72 0.74 4930
weighted avg 0.80 0.81 0.81 4930
performance on train data 0.8135902636916835
performance on test data 0.8078561287269286
from xgboost import XGBClassifier
start = time.process_time()
XGB = XGBClassifier(n_estimators=2000, eta=0.05, subsample=0.5, colsample_bytree=0.5)
XGB.fit(X_train, y_train)
y_predict = XGB.predict(X_test)
y_predict_train = XGB.predict(X_train)
processedTime = time.process_time() - start
acc_test = accuracy_score(y_test, y_predict)
acc_train = accuracy_score(y_train, y_predict_train)
print("Macro result for test data")
precision,recall,fscore,support=score(y_test,y_predict,average='macro')
print('Precision : {}'.format(precision))
print('Recall : {}'.format(recall))
print('F-score : {}'.format(fscore))
# print classification report - test
print("classification report for test")
print(classification_report(y_test, y_predict))
# print classification report - train
print("classification report for train")
print(classification_report(y_train,y_predict_train))
print("performance on train data" , acc_train) # performance on train data
print("performance on test data" , acc_test) # performance on test data
cm=metrics.confusion_matrix(y_test, y_predict,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g');
tempcompareDf = pd.DataFrame({
'Method': ['XGBClassifier'],
'Train_accuracy': acc_train,
'Test_accuracy': acc_test,
'precision' : precision,
'Recall' : recall,
'F-score' : fscore,
'processedTime' : processedTime
})
tempcompareDf = tempcompareDf[['Method', 'Train_accuracy','Test_accuracy', 'precision','Recall','F-score','processedTime']]
tempcompareDf
compareDf = pd.concat([compareDf, tempcompareDf])
[22:05:22] WARNING: ..\src\learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Macro result for test data
Precision : 0.7110747337407362
Recall : 0.6820076906113898
F-score : 0.6929278237564631
classification report for test
precision recall f1-score support
0 0.82 0.88 0.85 1552
1 0.60 0.48 0.53 561
accuracy 0.78 2113
macro avg 0.71 0.68 0.69 2113
weighted avg 0.76 0.78 0.77 2113
classification report for train
precision recall f1-score support
0 0.99 0.99 0.99 3622
1 0.97 0.97 0.97 1308
accuracy 0.99 4930
macro avg 0.98 0.98 0.98 4930
weighted avg 0.99 0.99 0.99 4930
performance on train data 0.9858012170385395
performance on test data 0.7761476573592049
from sklearn.ensemble import RandomForestClassifier
start = time.process_time()
# rfcl = RandomForestClassifier(n_estimators = 15, random_state=1,max_features=12, max_depth = 3,max_features = "auto")
rfcl = RandomForestClassifier(n_estimators=100, criterion='entropy', max_depth=5, min_samples_split=2,
min_samples_leaf=1,random_state=22)
rfcl = rfcl.fit(X_train, y_train)
y_predict = rfcl.predict(X_test)
y_predict_train = rfcl.predict(X_train)
processedTime = time.process_time() - start
acc_test = accuracy_score(y_test, y_predict)
acc_train = accuracy_score(y_train, y_predict_train)
print("Macro result for test data")
precision,recall,fscore,support=score(y_test,y_predict,average='macro')
print('Precision : {}'.format(precision))
print('Recall : {}'.format(recall))
print('F-score : {}'.format(fscore))
# print classification report - test
print("classification report for test")
print(classification_report(y_test, y_predict))
# print classification report - train
print("classification report for train")
print(classification_report(y_train,y_predict_train))
print("performance on train data" , acc_train) # performance on train data
print("performance on test data" , acc_test) # performance on test data
cm=metrics.confusion_matrix(y_test, y_predict,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g');
tempcompareDf = pd.DataFrame({
'Method': ['RandomForestClassifier'],
'Train_accuracy': acc_train,
'Test_accuracy': acc_test,
'precision' : precision,
'Recall' : recall,
'F-score' : fscore,
'processedTime' : processedTime
})
tempcompareDf = tempcompareDf[['Method', 'Train_accuracy','Test_accuracy', 'precision','Recall','F-score','processedTime']]
tempcompareDf
compareDf = pd.concat([compareDf, tempcompareDf])
Macro result for test data
Precision : 0.7631498158002457
Recall : 0.6860683472076741
F-score : 0.7073019260594653
classification report for test
precision recall f1-score support
0 0.82 0.93 0.87 1552
1 0.70 0.44 0.54 561
accuracy 0.80 2113
macro avg 0.76 0.69 0.71 2113
weighted avg 0.79 0.80 0.79 2113
classification report for train
precision recall f1-score support
0 0.82 0.93 0.87 3622
1 0.70 0.45 0.54 1308
accuracy 0.80 4930
macro avg 0.76 0.69 0.71 4930
weighted avg 0.79 0.80 0.79 4930
performance on train data 0.801419878296146
performance on test data 0.8021769995267393
compareDf
| Method | Train_accuracy | Test_accuracy | precision | Recall | F-score | processedTime | |
|---|---|---|---|---|---|---|---|
| 0 | Decision Tree - Pruned | 0.811765 | 0.795551 | 0.738101 | 0.741314 | 0.739666 | 0.046875 |
| 0 | LogisticRegression() | 0.796552 | 0.802650 | 0.751093 | 0.713707 | 0.727806 | 20.515625 |
| 0 | BernoulliNB() | 0.739757 | 0.741600 | 0.651202 | 0.558897 | 0.550853 | 1.921875 |
| 0 | GaussianNB() | 0.756389 | 0.762423 | 0.712372 | 0.749493 | 0.722553 | 0.968750 |
| 0 | DecisionTreeClassifier() | 0.997566 | 0.782773 | 0.720875 | 0.691071 | 0.702419 | 2.437500 |
| 0 | AdaBoosting | 0.800203 | 0.809276 | 0.769348 | 0.704559 | 0.724943 | 0.828125 |
| 0 | GradientBoosting | 0.813590 | 0.807856 | 0.760904 | 0.714975 | 0.731463 | 0.453125 |
| 0 | XGBClassifier | 0.985801 | 0.776148 | 0.711075 | 0.682008 | 0.692928 | 29.625000 |
| 0 | RandomForestClassifier | 0.801420 | 0.802177 | 0.763150 | 0.686068 | 0.707302 | 0.656250 |
from sklearn.model_selection import GridSearchCV
np.seterr(divide='ignore', invalid='ignore')
gbcl = GradientBoostingClassifier()
gbcl = gbcl.fit(X_train, y_train)
#Hyper Parameters Set
params = {
"learning_rate": [ 0.075, 0.1, 0.15, 0.2, 0.25],
"max_depth":[3,5,8],
"max_features":[8,9,10],
"subsample":[0.35, 0.5, 0.618, 0.8],
"n_estimators":[10,20,30]
}
#Making models with hyper parameters sets
model1 = GridSearchCV(gbcl, param_grid=params, n_jobs=1)
#Learning
model1.fit(X_train,y_train)
#The best hyper parameters set
print("Best Hyper Parameters:\n",model1.best_params_)
#Prediction
prediction=model1.predict(X_test)
prediction_train=model1.predict(X_train)
#importing the metrics module
from sklearn import metrics
#evaluation(Accuracy)
print("Accuracy:",metrics.accuracy_score(prediction,y_test))
print("Accuracy:",metrics.accuracy_score(prediction_train,y_train))
Best Hyper Parameters:
{'learning_rate': 0.2, 'max_depth': 3, 'max_features': 10, 'n_estimators': 30, 'subsample': 0.618}
Accuracy: 0.8078561287269286
Accuracy: 0.8117647058823529
import pickle
start = time.process_time()
gbcl = GradientBoostingClassifier(
learning_rate = 0.2, max_depth = 3, max_features = 10, n_estimators = 30, subsample = 0.618
)
gbcl = gbcl.fit(X_train, y_train)
# save the model to disk
filename = 'GBClassifier.sav'
pickle.dump(gbcl, open(filename, 'wb'))
import tkinter as tk
from tkinter import *
from tkinter import ttk
from tkinter.messagebox import showinfo
tenurVar=""
mchrVar=""
tchrVar=""
class predictInput(object):
def __init__(self):
self.Gender_Male=0
self.Gender_Female=0
self.genderVar = ""
self.seniorCitizenVar = 0
self.partnerVar = ""
self.dependentVar = ""
self.phoneServiceVar = ""
self.multipleLinesVar = ""
self.internetServiceVar = ""
self.onlineSecurityVar = ""
self.onlineBackupVar = ""
self.deviceProtectionVar = ""
self.techSupportVar = ""
self.streamingTVVar = ""
self.streamingMoviesVar = ""
self.contractVar = ""
self.paperlessBillingVar = ""
self.PaymentMethod_Bank_transfer_automatic = ""
self.PaymentMethod_Credit_card_automatic = ""
self.PaymentMethod_Electronic_Check = ""
self.PaymentMethod_Mailed_Check = ""
def setGender(self, gcArgs):
self.genderVar=gcArgs
def setSeniorCitizen(self, scArgs):
self.seniorCitizenVar
if(scArgs == 'Yes'):
self.seniorCitizenVar=1
if(scArgs == 'No'):
self.seniorCitizenVar=0
def setPartner(self, pArgs):
self.partnerVar=pArgs
def setDependent(self, dArgs):
self.dependentVar=dArgs
def setPhoneService(self, psArgs):
self.phoneServiceVar=psArgs
def setMultipleLines(self, mlArgs):
self.multipleLinesVar=mlArgs
def setInternetService(self, isArgs):
self.internetServiceVar=isArgs
def setOnlineSecurity(self, osArgs):
self.onlineSecurityVar=osArgs
def setOnlineBackup(self, obArgs):
self.onlineBackupVar=obArgs
def setDeviceProtection(self, dpArgs):
self.deviceProtectionVar=dpArgs
def setTechSupport(self, tsArgs):
self.techSupportVar=tsArgs
def setStreamingTV(self, stvArgs):
self.streamingTVVar=stvArgs
def setStreamingMovies(self, stmArgs):
self.streamingMoviesVar=stmArgs
def setContract(self, cArgs):
self.contractVar=cArgs
def setPaperlessBilling(self, pbArgs):
self.paperlessBillingVar=pbArgs
def setPaymentMethod(self, pmArgs):
self.paymentMethodVar=pmArgs
def getbooleanOutput(self,i):
switcher={
"No":2,
"Yes":1
}
return switcher.get(i,"Invalid entry")
def getInternetService(self,i):
switcher={
"Yes":1,
"No":2,
"No internet service":3
}
return switcher.get(i,"Invalid entry")
def getPhoneService(self,i):
switcher={
"Yes":1,
"No":2,
"No phone service":3
}
return switcher.get(i,"Invalid entry")
def getInternetServiceType(self,i):
switcher={
"No":3,
"DSL":2,
"Fiber optic":1
}
return switcher.get(i,"Invalid entry")
def getContractType(self,i):
switcher={
"Month-to-month":3,
"One year":2,
"Two year":1
}
return switcher.get(i,"Invalid entry")
def getpaymentMethodType(self,inp):
if(inp == 'Electronic check'):
self.PaymentMethod_Bank_transfer_automatic = 0
self.PaymentMethod_Credit_card_automatic = 0
self.PaymentMethod_Electronic_Check = 1
self.PaymentMethod_Mailed_Check = 0
if(inp == 'Mailed check'):
self.PaymentMethod_Bank_transfer_automatic = 0
self.PaymentMethod_Credit_card_automatic = 0
self.PaymentMethod_Electronic_Check = 0
self.PaymentMethod_Mailed_Check = 1
if(inp == 'Bank transfer (automatic)'):
self.PaymentMethod_Bank_transfer_automatic = 1
self.PaymentMethod_Credit_card_automatic = 0
self.PaymentMethod_Electronic_Check = 0
self.PaymentMethod_Mailed_Check = 0
if(inp == 'Credit card (automatic)'):
self.PaymentMethod_Bank_transfer_automatic = 0
self.PaymentMethod_Credit_card_automatic = 1
self.PaymentMethod_Electronic_Check = 0
self.PaymentMethod_Mailed_Check = 0
def getGenderType(self,genderArgs):
if(genderArgs == 'Male'):
self.Gender_Male=1
self.Gender_Female=0
if(genderArgs == 'Female'):
self.Gender_Male=0
self.Gender_Female=1
def UpdateVariable(self):
if(self.genderVar and self.partnerVar and self.dependentVar and self.phoneServiceVar
and self.multipleLinesVar and self.internetServiceVar and self.onlineSecurityVar and self.onlineBackupVar
and self.deviceProtectionVar and self.techSupportVar and self.streamingTVVar and self.streamingMoviesVar
and self.contractVar and self.paperlessBillingVar and self.paymentMethodVar):
print("genderVar :: ", self.genderVar , " ", self.getGenderType(self.genderVar), " "
,self.Gender_Male ," " ,self.Gender_Female)
print("seniorCitizenVar :: ", self.seniorCitizenVar)
print("partnerVar :: ", self.partnerVar, " " , self.getbooleanOutput(self.partnerVar))
print("dependentVar :: ", self.dependentVar, " " , self.getbooleanOutput(self.dependentVar))
print("phoneService :: ", self.phoneServiceVar," ", self.getPhoneService(self.phoneServiceVar))
print("multipleLines :: ", self.multipleLinesVar," ", self.getPhoneService(self.multipleLinesVar))
print("internetService :: ", self.internetServiceVar, " " , self.getInternetServiceType(self.internetServiceVar))
print("onlineSecurity :: ", self.onlineSecurityVar, " " , self.getInternetService(self.onlineSecurityVar))
print("onlineBackup :: ", self.onlineBackupVar, " " , self.getInternetService(self.onlineBackupVar))
print("deviceProtection :: ", self.deviceProtectionVar, " " , self.getInternetService(self.deviceProtectionVar))
print("techSupport :: ", self.techSupportVar, " " , self.getInternetService(self.techSupportVar))
print("streamingTV :: ", self.streamingTVVar, " " , self.getInternetService(self.streamingTVVar))
print("streamingMovies :: ", self.streamingMoviesVar, " " , self.getInternetService(self.streamingMoviesVar))
print("contract :: ", self.contractVar , " ", self.getContractType(self.contractVar))
print("paperlessBilling :: ", self.paperlessBillingVar, " ", self.getbooleanOutput(self.paperlessBillingVar))
print("paymentMethod :: "
, self.paymentMethodVar ," ", self.getpaymentMethodType(self.paymentMethodVar)
, self.PaymentMethod_Bank_transfer_automatic ," "
, self.PaymentMethod_Credit_card_automatic ," "
, self.PaymentMethod_Electronic_Check," "
, self.PaymentMethod_Mailed_Check)
print("Tenure ", tenurVar)
print("Monthly Charges", mchrVar)
print("Total Charegs" ,tchrVar)
variableListInput=[self.seniorCitizenVar,
self.getbooleanOutput(self.partnerVar),
self.getbooleanOutput(self.dependentVar),
self.getPhoneService(self.multipleLinesVar),
self.getInternetServiceType(self.internetServiceVar),
self.getInternetService(self.onlineSecurityVar),
self.getContractType(self.contractVar),
self.getbooleanOutput(self.paperlessBillingVar),
self.PaymentMethod_Bank_transfer_automatic ,
self.PaymentMethod_Credit_card_automatic ,
self.PaymentMethod_Electronic_Check,
self.PaymentMethod_Mailed_Check,
mchrVar,
tchrVar,
tenurVar
]
for i in range(0, len(variableListInput)):
variableListInput[i] = int(variableListInput[i])
loaded_model = pickle.load(open(filename, 'rb'))
arr = np.array(variableListInput)
arr_2d = np.reshape(arr, (-1, 15))
output = loaded_model.predict(arr_2d)
if(output == 1):
tk.messagebox.showinfo("Predicted Output","Customer would Churn")
if(output == 0):
tk.messagebox.showinfo("Predicted Output","Customer would not Churn")
else :
msg = "Please provide inputs of all fields"
tk.messagebox.showwarning(title="Field input", message=msg)
def main():
model = predictInput()
root = Tk()
root.title("GUI")
root.geometry('800x700')
app = Frame(root)
app.grid()
canvas = Canvas(app, width = 300, height = 75)
canvas.pack(side = "left")
ttk.Label(root, text = "Gender ",font = ("Times New Roman", 13)).grid(column = 0,row = 1, padx = 1, pady = 3)
ttk.Label(root, text = "Senior Citizen ",font = ("Times New Roman", 13)).grid(column = 0,row = 2, padx = 1, pady = 3)
ttk.Label(root, text = "Partner",font = ("Times New Roman", 13)).grid(column = 0,row = 3, padx = 1, pady = 3)
ttk.Label(root, text = "Dependents",font = ("Times New Roman", 13)).grid(column = 0,row = 4, padx = 1, pady = 3)
ttk.Label(root, text = "tenure",font = ("Times New Roman", 13)).grid(column = 0,row = 5, padx = 1, pady = 3)
ttk.Label(root, text = "PhoneService",font = ("Times New Roman", 13)).grid(column = 0,row = 6, padx = 1, pady = 3)
ttk.Label(root, text = "MultipleLines",font = ("Times New Roman", 13)).grid(column = 0,row = 7, padx = 1, pady = 3)
ttk.Label(root, text = "InternetService",font = ("Times New Roman", 13)).grid(column = 0,row = 8, padx = 1, pady = 3)
ttk.Label(root, text = "OnlineSecurity",font = ("Times New Roman", 13)).grid(column = 0,row = 9, padx = 1, pady = 3)
ttk.Label(root, text = "OnlineBackup",font = ("Times New Roman", 13)).grid(column = 0,row = 10, padx = 1, pady = 3)
ttk.Label(root, text = "DeviceProtection",font = ("Times New Roman", 13)).grid(column = 0,row = 11, padx = 1, pady = 3)
ttk.Label(root, text = "TechSupport",font = ("Times New Roman", 13)).grid(column = 0,row = 12, padx = 1, pady = 3)
ttk.Label(root, text = "StreamingTV",font = ("Times New Roman", 13)).grid(column = 0,row = 13, padx = 1, pady = 3)
ttk.Label(root, text = "StreamingMovies",font = ("Times New Roman", 13)).grid(column = 0,row = 14, padx = 1, pady = 3)
ttk.Label(root, text = "Contract",font = ("Times New Roman", 13)).grid(column = 0,row = 15, padx = 1, pady = 3)
ttk.Label(root, text = "PaperlessBilling",font = ("Times New Roman", 13)).grid(column = 0,row = 16, padx = 1, pady = 3)
ttk.Label(root, text = "PaymentMethod",font = ("Times New Roman", 13)).grid(column = 0,row = 17, padx = 1, pady = 3)
ttk.Label(root, text = "MonthlyCharges",font = ("Times New Roman", 13)).grid(column = 0,row = 18, padx = 1, pady = 3)
ttk.Label(root, text = "TotalCharges",font = ("Times New Roman", 13)).grid(column = 0,row = 19, padx = 1, pady = 3)
def Gender_changed(event):
genVar = Gender_cb.get()
model.setGender(genVar)
gender = ('Male', 'Female')
selected_gender = tk.StringVar()
Gender_cb = ttk.Combobox(root, textvariable=selected_gender)
Gender_cb['values'] = gender
Gender_cb['state'] = 'readonly' # normal
Gender_cb.grid( column = 1,row = 1,padx=5, pady=5)
Gender_cb.bind('<<ComboboxSelected>>', Gender_changed)
def Senior_Citizen_changed(event):
scVar = Senior_Citizen_cb.get()
model.setSeniorCitizen(scVar)
Senior_Citizen = ('Yes', 'No')
selected_Senior_Citizen = tk.StringVar()
Senior_Citizen_cb = ttk.Combobox(root, textvariable=selected_Senior_Citizen)
Senior_Citizen_cb['values'] = Senior_Citizen
Senior_Citizen_cb['state'] = 'readonly' # normal
Senior_Citizen_cb.grid( column = 1,row = 2,padx=5, pady=5)
Senior_Citizen_cb.bind('<<ComboboxSelected>>', Senior_Citizen_changed)
def Partner_changed(event):
pVar=Partner_cb.get()
model.setPartner(pVar)
Partner = ('Yes', 'No')
selected_Partner = tk.StringVar()
Partner_cb = ttk.Combobox(root, textvariable=selected_Partner)
Partner_cb['values'] = Partner
Partner_cb['state'] = 'readonly' # normal
Partner_cb.grid( column = 1,row = 3,padx=5, pady=5)
Partner_cb.bind('<<ComboboxSelected>>', Partner_changed)
def Dependents_changed(event):
dvar=Dependents_cb.get()
model.setDependent(dvar)
Dependents = ('Yes', 'No')
selected_Dependents = tk.StringVar()
Dependents_cb = ttk.Combobox(root, textvariable=selected_Dependents)
Dependents_cb['values'] = Dependents
Dependents_cb['state'] = 'readonly' # normal
Dependents_cb.grid( column = 1,row = 4,padx=5, pady=5)
Dependents_cb.bind('<<ComboboxSelected>>', Dependents_changed)
def PhoneService_changed(event):
psVar=PhoneService_cb.get()
model.setPhoneService(psVar)
PhoneService = ('Yes', 'No')
selected_PhoneService = tk.StringVar()
PhoneService_cb = ttk.Combobox(root, textvariable=selected_PhoneService)
PhoneService_cb['values'] = PhoneService
PhoneService_cb['state'] = 'readonly' # normal
PhoneService_cb.grid( column = 1,row = 6,padx=5, pady=5)
PhoneService_cb.bind('<<ComboboxSelected>>', PhoneService_changed)
def MultipleLines_changed(event):
mlvar=MultipleLines_cb.get()
model.setMultipleLines(mlvar)
MultipleLines = ('Yes', 'No', 'No phone service')
selected_MultipleLines = tk.StringVar()
MultipleLines_cb = ttk.Combobox(root, textvariable=selected_MultipleLines)
MultipleLines_cb['values'] = MultipleLines
MultipleLines_cb['state'] = 'readonly' # normal
MultipleLines_cb.grid( column = 1,row = 7,padx=5, pady=5)
MultipleLines_cb.bind('<<ComboboxSelected>>', MultipleLines_changed)
def InternetService_changed(event):
isVar=InternetService_cb.get()
model.setInternetService(isVar)
InternetService = ('DSL', 'Fiber optic', 'No')
selected_InternetService = tk.StringVar()
InternetService_cb = ttk.Combobox(root, textvariable=selected_InternetService)
InternetService_cb['values'] = InternetService
InternetService_cb['state'] = 'readonly' # normal
InternetService_cb.grid( column = 1,row = 8,padx=5, pady=5)
InternetService_cb.bind('<<ComboboxSelected>>', InternetService_changed)
def OnlineSecurity_changed(event):
osVar=OnlineSecurity_cb.get()
model.setOnlineSecurity(osVar)
OnlineSecurity = ('No', 'Yes', 'No internet service')
selected_OnlineSecurity = tk.StringVar()
OnlineSecurity_cb = ttk.Combobox(root, textvariable=selected_OnlineSecurity)
OnlineSecurity_cb['values'] = OnlineSecurity
OnlineSecurity_cb['state'] = 'readonly' # normal
OnlineSecurity_cb.grid( column = 1,row = 9,padx=5, pady=5)
OnlineSecurity_cb.bind('<<ComboboxSelected>>', OnlineSecurity_changed)
def OnlineBackup_changed(event):
obVar=OnlineBackup_cb.get()
model.setOnlineBackup(obVar)
OnlineBackup = ('Yes', 'No', 'No internet service')
selected_OnlineBackup = tk.StringVar()
OnlineBackup_cb = ttk.Combobox(root, textvariable=selected_OnlineBackup)
OnlineBackup_cb['values'] = OnlineBackup
OnlineBackup_cb['state'] = 'readonly' # normal
OnlineBackup_cb.grid( column = 1,row = 10,padx=5, pady=5)
OnlineBackup_cb.bind('<<ComboboxSelected>>', OnlineBackup_changed)
def DeviceProtection_changed(event):
dpVar=DeviceProtection_cb.get()
model.setDeviceProtection(dpVar)
DeviceProtection = ('Yes', 'No', 'No internet service')
selected_DeviceProtection = tk.StringVar()
DeviceProtection_cb = ttk.Combobox(root, textvariable=selected_DeviceProtection)
DeviceProtection_cb['values'] = DeviceProtection
DeviceProtection_cb['state'] = 'readonly' # normal
DeviceProtection_cb.grid( column = 1,row = 11,padx=5, pady=5)
DeviceProtection_cb.bind('<<ComboboxSelected>>', DeviceProtection_changed)
def TechSupport_changed(event):
tsVar=TechSupport_cb.get()
model.setTechSupport(tsVar)
TechSupport = ('Yes', 'No', 'No internet service')
selected_TechSupport = tk.StringVar()
TechSupport_cb = ttk.Combobox(root, textvariable=selected_TechSupport)
TechSupport_cb['values'] = TechSupport
TechSupport_cb['state'] = 'readonly' # normal
TechSupport_cb.grid( column = 1,row = 12,padx=5, pady=5)
TechSupport_cb.bind('<<ComboboxSelected>>', TechSupport_changed)
def StreamingTV_changed(event):
stvVar=StreamingTV_cb.get()
model.setStreamingTV(stvVar)
StreamingTV = ('Yes', 'No', 'No internet service')
selected_StreamingTV = tk.StringVar()
StreamingTV_cb = ttk.Combobox(root, textvariable=selected_StreamingTV)
StreamingTV_cb['values'] = StreamingTV
StreamingTV_cb['state'] = 'readonly' # normal
StreamingTV_cb.grid( column = 1,row = 13,padx=5, pady=5)
StreamingTV_cb.bind('<<ComboboxSelected>>', StreamingTV_changed)
def StreamingMovies_changed(event):
smVar=StreamingMovies_cb.get()
model.setStreamingMovies(smVar)
StreamingMovies = ('Yes', 'No', 'No internet service')
selected_StreamingMovies = tk.StringVar()
StreamingMovies_cb = ttk.Combobox(root, textvariable=selected_StreamingMovies)
StreamingMovies_cb['values'] = StreamingMovies
StreamingMovies_cb['state'] = 'readonly' # normal
StreamingMovies_cb.grid( column = 1,row = 14,padx=5, pady=5)
StreamingMovies_cb.bind('<<ComboboxSelected>>', StreamingMovies_changed)
def Contract_changed(event):
cVar=Contract_cb.get()
model.setContract(cVar)
Contract = ('Month-to-month','One year','Two year')
selected_Contract = tk.StringVar()
Contract_cb = ttk.Combobox(root, textvariable=selected_Contract)
Contract_cb['values'] = Contract
Contract_cb['state'] = 'readonly' # normal
Contract_cb.grid( column = 1,row = 15,padx=5, pady=5)
Contract_cb.bind('<<ComboboxSelected>>', Contract_changed)
def PaperlessBilling_changed(event):
pbVar=PaperlessBilling_cb.get()
model.setPaperlessBilling(pbVar)
PaperlessBilling = ('Yes','No')
selected_PaperlessBilling = tk.StringVar()
PaperlessBilling_cb = ttk.Combobox(root, textvariable=selected_PaperlessBilling)
PaperlessBilling_cb['values'] = PaperlessBilling
PaperlessBilling_cb['state'] = 'readonly' # normal
PaperlessBilling_cb.grid( column = 1,row = 16,padx=5, pady=5)
PaperlessBilling_cb.bind('<<ComboboxSelected>>', PaperlessBilling_changed)
def PaymentMethod_changed(event):
pmVar=PaymentMethod_cb.get()
model.setPaymentMethod(pmVar)
PaymentMethod = ('Electronic check','Mailed check','Bank transfer (automatic)','Credit card (automatic)')
selected_PaymentMethod = tk.StringVar()
PaymentMethod_cb = ttk.Combobox(root, textvariable=selected_PaymentMethod)
PaymentMethod_cb['values'] = PaymentMethod
PaymentMethod_cb['state'] = 'readonly' # normal
PaymentMethod_cb.grid( column = 1,row = 17,padx=5, pady=5)
PaymentMethod_cb.bind('<<ComboboxSelected>>', PaymentMethod_changed)
tenureVar=tk.DoubleVar()
def setTenurVar(value1):
global tenurVar
tenurVar=value1
Tenure_entry = tk.Entry(root)
Tenure_entry.bind("<Return>", (lambda event: setTenurVar(Tenure_entry.get())))
Tenure_entry.grid(row=5,column=1)
mchrVar=tk.DoubleVar()
def setmchrVar(value1):
global mchrVar
mchrVar=value1
MonthlyCharges_entry = tk.Entry(root)
MonthlyCharges_entry.bind("<Return>", (lambda event: setmchrVar(MonthlyCharges_entry.get())))
MonthlyCharges_entry.grid(row=18,column=1)
tchrVar=tk.DoubleVar()
def settchrVar(value1):
global tchrVar
tchrVar=value1
model.UpdateVariable()
TotalCharges_entry = tk.Entry(root)
TotalCharges_entry.bind("<Return>", (lambda event: [settchrVar(TotalCharges_entry.get())]))
TotalCharges_entry.grid(row=19,column=1)
btn_loadVar = Button(root, text="Submit",command=lambda: [setTenurVar(Tenure_entry.get()),setmchrVar(MonthlyCharges_entry.get()),settchrVar(TotalCharges_entry.get())])
btn_loadVar.grid(column = 0,row = 20)
root.mainloop()
if __name__ == "__main__":
main()
genderVar :: Male None 1 0 seniorCitizenVar :: 1 partnerVar :: Yes 1 dependentVar :: Yes 1 phoneService :: Yes 1 multipleLines :: No phone service 3 internetService :: Fiber optic 1 onlineSecurity :: No internet service 3 onlineBackup :: No 2 deviceProtection :: No 2 techSupport :: Yes 1 streamingTV :: No 2 streamingMovies :: Yes 1 contract :: Two year 1 paperlessBilling :: Yes 1 paymentMethod :: Electronic check None 0 0 1 0 Tenure 45 Monthly Charges 9 Total Charegs 99
The print statement would help us to view the user selected input.
Contract is one of very important varaible, where in customer who stay around 2 years with service tends to churn less when compared with Month-to-Month contract.
To retain customer, it is very important to have good fibre optics internet service, this factor adds volatility to the churn.